Final project

  • Using knowledge from this course, once again analyze your chosen dataset (which was used to demonstrate visualization methods). Apply AS MINIMUM ONE learned data analysis method to this dataset. Send the report to the teacher in Moodle for verification.

US Cars Dataset

Online Car Auction in North American

12.jpg

Context

  • Cars'data was scraped from AUCTION EXPORT.com. This dataset included Information about 28 brands of clean and used vehicles for sale in US. Twelve features were assembled for each car in the dataset.

11.jpg

  • This dataset includes 12 features:
In [138]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import researchpy as rp
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
df = pd.read_csv('C:/Users/As/Desktop/123/USA_cars_datasets.csv')
df.drop('Unnamed: 0',axis = 1, inplace = True)
df.head()
Out[138]:
price brand model year title_status mileage color vin lot state country condition
0 6300 toyota cruiser 2008 clean vehicle 274117.000 black jtezu11f88k007763 159348797 new jersey usa 10 days left
1 2899 ford se 2011 clean vehicle 190552.000 silver 2fmdk3gc4bbb02217 166951262 tennessee usa 6 days left
2 5350 dodge mpv 2018 clean vehicle 39590.000 silver 3c4pdcgg5jt346413 167655728 georgia usa 2 days left
3 25000 ford door 2014 clean vehicle 64146.000 blue 1ftfw1et4efc23745 167753855 virginia usa 22 hours left
4 27700 chevrolet 1500 2018 clean vehicle 6654.000 red 3gcpcrec2jg473991 167763266 florida usa 22 hours left

Description of Cars

In [13]:
df.describe()
Out[13]:
price year mileage lot
count 2499.000000 2499.000000 2.499000e+03 2.499000e+03
mean 18767.671469 2016.714286 5.229869e+04 1.676914e+08
std 12116.094936 3.442656 5.970552e+04 2.038772e+05
min 0.000000 1973.000000 0.000000e+00 1.593488e+08
25% 10200.000000 2016.000000 2.146650e+04 1.676253e+08
50% 16900.000000 2018.000000 3.536500e+04 1.677451e+08
75% 25555.500000 2019.000000 6.347250e+04 1.677798e+08
max 84900.000000 2020.000000 1.017936e+06 1.678055e+08

Observations:

  • Minimum price of car is zero. this cannot be true. we need to replace these values with an appropriate value.
  • Maximum price of the car in this dataset is $84990.
  • Oldest model of car is from 1973.
  • First quartile value (25%) lies in the year 2016. This indicated that more than 75% of the cars are new models and are built after 2016.
In [17]:
fig = px.treemap(df, path=["brand",'model','color'],
                  color='brand', hover_data=['model'],
                  color_continuous_scale='rainbow')
fig.show()

Output summary statistics by groups - count, mean, sd

In [14]:
rp.summary_cont(df[['year','mileage','lot','price']])

Out[14]:
Variable N Mean SD SE 95% Conf. Interval
0 year 2499.0 2.016714e+03 3.442656 0.068867 2.016579e+03 2.016849e+03
1 mileage 2499.0 5.229869e+04 59705.516356 1194.349221 4.995667e+04 5.464070e+04
2 lot 2499.0 1.676914e+08 203877.210516 4078.359964 1.676834e+08 1.676994e+08
3 price 2499.0 1.876767e+04 12116.094936 242.370378 1.829240e+04 1.924294e+04

Correlation

In [180]:
corr = df.corr()
corr.style.background_gradient(cmap='magma')
Out[180]:
price year mileage lot
price 1 0.418274 -0.400838 0.159369
year 0.418274 1 -0.594686 0.164582
mileage -0.400838 -0.594686 1 -0.130475
lot 0.159369 0.164582 -0.130475 1

Observations:

  • Price and Model year have poistive correlation,i.e. newer the car, higher the price.
  • Price and Mileage have negatve correlation, i.e. as the miles traveled by the car increases, it's price decreases.

Subdivision by brand

Most expensive brand

In [128]:
temp = pd.DataFrame(df.groupby(['brand']).count()['vin'])
temp.sort_values('vin', ascending = False, inplace = True)
# temp[temp['vin'] > 10].sum().values / temp.sum().values == 0.98239588
# temp[temp['vin'] > 10].count().values / temp.count().values == 0.48148148
brand_list = temp[temp['vin'] > 10].index.values
In [130]:
av_prices = []
for i in brand_list:
    x = df[df['brand']==i]
    av_price = sum(x.price)/len(x)
    av_prices.append(av_price)
data = pd.DataFrame({'brand_list': brand_list,'av_prices':av_prices})
new_index = (data['av_prices'].sort_values(ascending=False)).index.values
sorted_data = data.reindex(new_index)

sns.barplot(y=sorted_data['brand_list'], x=sorted_data['av_prices'], palette = 'GnBu_d')
plt.xlabel('Average Price ($)', fontsize = 14)
plt.ylabel('Brand', fontsize = 14)
plt.title('Average price per brand', fontsize = 16)
plt.show()
In [20]:
price = df.groupby('brand')['price'].max().reset_index()
price  = price.sort_values(by="price")
price = price.tail(10)
fig = px.pie(price,
             values="price",
             names="brand",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="percent+label")
fig.show()

Most frequent brand

In [133]:
counts = []
for i in brand_list:
    x = df[df['brand']==i]
    count = len(x.vin)
    counts.append(count)
data2 = pd.DataFrame({'brand_list': brand_list,'counts':counts})
new_index2 = (data2['counts'].sort_values(ascending=False)).index.values
sorted_data2 = data2.reindex(new_index2)

sns.barplot(y=sorted_data2['brand_list'], x=sorted_data2['counts'], palette = 'GnBu_d')
plt.xlabel('# of brands', fontsize = 14)
plt.ylabel('Brand', fontsize = 14)
plt.title('Number of brands', fontsize = 16)
plt.show()
In [22]:
color = df.loc[:,["color"]]
color['count'] = color.groupby([color.color])['color'].transform('count')
color = color.drop_duplicates()
color = color.sort_values(by="count",ascending = False)
color = color.head(10)
fig = px.pie(color,
             values="count",
             names="color",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="percent+label")
fig.show()

Model in brand

In [151]:
sorted_cars = car_brands.sort_values(by=[('price','count')],ascending=False)
top_six = sorted_cars[0:6]
top_six
Out[151]:
price model year title_status mileage color vin lot state country condition
count count count count count count count count count count count
brand
ford 1235 1235 1235 1235 1235 1235 1235 1235 1235 1235 1235
dodge 432 432 432 432 432 432 432 432 432 432 432
nissan 312 312 312 312 312 312 312 312 312 312 312
chevrolet 297 297 297 297 297 297 297 297 297 297 297
gmc 42 42 42 42 42 42 42 42 42 42 42
jeep 30 30 30 30 30 30 30 30 30 30 30
In [158]:
plt.figure(figsize=(15,7))
plt.bar(x=top_six.index,height=top_six['price']['count'],alpha=.6,color=sns.color_palette('dark'))
plt.ylim(0,1500)
plt.title('Top Six Car Brands',fontsize=15,fontweight='bold')
plt.xlabel('Car Brand',fontsize=13,color='b')
plt.ylabel('Quantity',fontsize=13,color='b');
In [149]:
df['count'] = 1
brand_list = temp[temp['vin'] > 20].index.values
cars_b = df[np.in1d(df['brand'],brand_list)]
c_sun = px.sunburst(cars_b, path = ['brand','model'], values = 'count', color = 'price', 
            width = 800, height = 800, color_continuous_scale = 'twilight')
df.drop(columns = 'count', inplace = True)
c_sun.show()

Brand cars popularity per year

In [30]:
#top brand= Ford
perb= df.loc[:,["year","brand"]]
perb['count'] = perb.groupby([perb.brand,perb.year])['brand'].transform('count')
perb= perb.drop_duplicates()
perb= perb.sort_values(by="year",ascending = False)
top_brand = ['ford', 'dodge',"nissan",'chevrolet'] 
perb = perb.loc[perb['brand'].isin(top_brand)] 
perb = perb[perb.year>2015]
perb = perb.sort_values(by="year")

fig=px.bar(perb,x='brand', y="count", animation_frame="year", 
           animation_group="brand", color="brand", hover_name="brand")
fig.show()

Capacitance price of Cars by Model Year

In [39]:
import plotly.express as px
fig = px.scatter(df, x="year", y="price", color="brand",
                 size='price')
fig.show()
In [21]:
import plotly.express as px
cars_by_model_year = df.groupby('year')['model'].count().reset_index().sort_values('model',ascending = False)
cars_by_model_year = cars_by_model_year[cars_by_model_year['year'] >= 2010]
cars_by_model_year = cars_by_model_year.rename(columns = {'model':'count'})
fig = px.bar(cars_by_model_year, x='year', y='count', color='count')
fig.show()
In [189]:
# ------------------------------------------------------
# PLOTTING THE BREAKDOWN OF THE MORE PREVALENT YEARS
# ------------------------------------------------------
fig = plt.figure(figsize=(12,5))
ax = fig.add_axes([0,0,1,1])
sns.countplot(x='year', data=df, ax=ax);

Capacitance price of brand cars

In [76]:
import matplotlib.pyplot as plt
x_values = df.price
fig, ax = plt.subplots(figsize=(17, 7))
plt.scatter(x_values, range(len(x_values)))
plt.show()
In [75]:
fig, ax = plt.subplots(figsize=(17, 7))
ax1 = sns.distplot(df.price)
plt.axvline(np.mean(df.price), color='black', linestyle='dashed', linewidth=5);

Price vs Model Year

In [2]:
data = df[['price','year']]
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x='year', y="price", data=data)
  • As expected price of the cars and the model year are proportional.i.e. Newer the model of the car, higher the price.

Pairplot

In [84]:
#scatterplot
sns.set()
sns.pairplot(df, size = 3)
plt.show()
  • A pairplot plots a pairwise relationships in a dataset. The pairplot function creates a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. That creates plots as shown above.

Calculate T-Test

In [31]:
def compare_2_groups(arr_1, arr_2, alpha, sample_size):
    stat, p = ttest_ind(arr_1, arr_2)
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    if p > alpha:
        print('Same distributions (fail to reject H0)')
    else:
        print('Different distributions (reject H0)')
In [26]:
from scipy.stats import f_oneway
from scipy.stats import ttest_ind
sample_size=15
ger_sampled = df.price
fr_sampled = df.mileage
compare_2_groups(df.price,df.mileage, 0.05, sample_size)
Statistics=-27.514, p=0.000
Different distributions (reject H0)
In [32]:
import scipy
scipy.stats.ttest_rel(df.price,df.mileage)
Out[32]:
Ttest_relResult(statistic=-25.58740862260307, pvalue=1.9029049491260088e-128)
In [96]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(17, 7))
ax.hist(df.price,color='indigo');
ax.set_xlabel('Result');
ax.set_title('price of brand cars');
In [98]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(17, 7))
ax.hist(df.year,color='indigo');
ax.set_xlabel('Result');
ax.set_title('sale of cars group by year');

Assumptions of the F-test

In [14]:
import scipy.stats as stats
stats.f_oneway(df.price,df.mileage)
Out[14]:
F_onewayResult(statistic=757.0150214182844, pvalue=2.705711374302067e-155)

Linear Regresssion Model to determine price of car

In [34]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

np.set_printoptions(suppress=True)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn import linear_model

from sklearn.feature_selection import chi2


## function to get to know our data

def understand_variables(dataset):
    print(type(dataset))
    print(dataset.shape)
    print(dataset.head())
    print(dataset.columns)
    print(dataset.nunique(axis=0))
    print(dataset.describe())
    print(dataset.describe(exclude=[np.number]))
    
## function to deal with outliers
    
def outlier_processing(dataset):
    # Using IQR

    Q1 = dataset.quantile(0.25)
    Q3 = dataset.quantile(0.75)
    IQR = Q3 - Q1
    
    #outlier_col = ['year']
    
    print("\n-------------\n% of outliers\n")    
    print(((dataset < (Q1 - 1.5 * IQR)) |(dataset > (Q3 + 1.5 * IQR))).sum()/len(dataset)*100)
    
    for col in list(IQR.index): 
        dataset.loc[dataset[col] < (Q1 - 1.5 * IQR)[col],[col]] = (Q1 - 1.5 * IQR)[col]
        dataset.loc[dataset[col] > (Q3 + 1.5 * IQR)[col],[col]] = (Q3 + 1.5 * IQR)[col]
        
        dataset[col] = dataset[col].round(0).astype(int)
    
    return dataset
    
    

cars_dataset =df

# dropping unnecessary columns 

cars_dataset = cars_dataset.drop(["lot","vin"],axis=1)
understand_variables(cars_dataset)



################ Feature engineering ###########

######### convert year to age (2020 - year)
cars_dataset.year = 2020 - cars_dataset.year

######## condition column : [Listings expired = 0, remove 'left' from others, convert everything to minutess ]

cars_dataset.loc[cars_dataset.condition == "Listing Expired", 'condition'] = "0 minutes left"
cars_dataset['condition'] = cars_dataset.condition.str.replace("left","")
cars_dataset.loc[cars_dataset.condition.str.contains("minutes"),'condition'] = (cars_dataset.loc[cars_dataset.condition.str.contains("minutes"),'condition'].astype(str).str.split().str[0].astype(int)).astype(str)
cars_dataset.loc[cars_dataset.condition.str.contains("hours"),'condition'] = (cars_dataset.loc[cars_dataset.condition.str.contains("hours"),'condition'].astype(str).str.split().str[0].astype(int) * 60).astype(str)
cars_dataset.loc[cars_dataset.condition.str.contains("days"),'condition'] = (cars_dataset.loc[cars_dataset.condition.str.contains("days"),'condition'].astype(str).str.split().str[0].astype(int) * 60*24).astype(str)
cars_dataset.condition = cars_dataset.condition.astype(int)

######## dealing with outliers ########

cars_dataset = outlier_processing(cars_dataset)

####### get dummies ########

cars_dataset = pd.get_dummies(cars_dataset, dummy_na=True)


############# Linear regression ##############

    
X = cars_dataset.drop("price",axis=1)
y = cars_dataset["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


reg = LinearRegression().fit(X_train, y_train)

y_pred = reg.predict(X_test)

y_pred = pd.Series(np.absolute(y_pred))

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

#print("coef_pval:\n", stats.coef_pval(reg, X_train, y_train))

print("R_2 = " +str(r2_score(y_test, y_pred)*100)+" %")

scores, pvalues = chi2(X_train, y_train)
scores = pd.Series(scores)
pvalues = pd.Series(pvalues)

sig_p_val = pvalues[pvalues<0.05]
cars_col_index = sig_p_val.index

cars_col_index = pd.Series(cars_col_index)

cars_dataset = cars_dataset.iloc[:,cars_col_index]

print("Retained columns : " +  str(cars_dataset.columns))


X = cars_dataset.drop("price",axis=1)
y = cars_dataset["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


reg = LinearRegression().fit(X_train, y_train)

y_pred = reg.predict(X_test)

y_pred = pd.Series(np.absolute(y_pred))

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

#print("coef_pval:\n", stats.coef_pval(reg, X_train, y_train))

print("R_2 = " +str(r2_score(y_test, y_pred)*100)+" %")
<class 'pandas.core.frame.DataFrame'>
(2499, 10)
   price      brand    model  year   title_status    mileage   color  \
0   6300     toyota  cruiser  2008  clean vehicle 274117.000   black   
1   2899       ford       se  2011  clean vehicle 190552.000  silver   
2   5350      dodge      mpv  2018  clean vehicle  39590.000  silver   
3  25000       ford     door  2014  clean vehicle  64146.000    blue   
4  27700  chevrolet     1500  2018  clean vehicle   6654.000     red   

        state country      condition  
0  new jersey     usa   10 days left  
1   tennessee     usa    6 days left  
2     georgia     usa    2 days left  
3    virginia     usa  22 hours left  
4     florida     usa  22 hours left  
Index(['price', 'brand', 'model', 'year', 'title_status', 'mileage', 'color',
       'state', 'country', 'condition'],
      dtype='object')
price            790
brand             28
model            127
year              30
title_status       2
mileage         2439
color             49
state             44
country            2
condition         47
dtype: int64
          price     year     mileage
count  2499.000 2499.000    2499.000
mean  18767.671 2016.714   52298.685
std   12116.095    3.443   59705.516
min       0.000 1973.000       0.000
25%   10200.000 2016.000   21466.500
50%   16900.000 2018.000   35365.000
75%   25555.500 2019.000   63472.500
max   84900.000 2020.000 1017936.000
       brand model   title_status  color         state country    condition
count   2499  2499           2499   2499          2499    2499         2499
unique    28   127              2     49            44       2           47
top     ford  door  clean vehicle  white  pennsylvania     usa  2 days left
freq    1235   651           2336    707           299    2492          832

-------------
% of outliers

brand           0.000
color           0.000
condition      13.405
country         0.000
mileage         7.563
model           0.000
price           2.561
state           0.000
title_status    0.000
year            5.042
dtype: float64
Mean Absolute Error: 311579947.006698
Mean Squared Error: 8.814536799225571e+18
Root Mean Squared Error: 2968928560.8154283
R_2 = -7089998291544.65 %
Retained columns : Index(['price', 'year', 'mileage', 'brand_buick', 'brand_chevrolet',
       'brand_ford', 'brand_harley-davidson', 'brand_heartland', 'brand_honda',
       'brand_hyundai',
       ...
       'state_nevada', 'state_new hampshire', 'state_new mexico',
       'state_north carolina', 'state_oklahoma', 'state_tennessee',
       'state_texas', 'state_utah', 'state_vermont', 'state_nan'],
      dtype='object', length=106)
Mean Absolute Error: 5376.09284146994
Mean Squared Error: 53382387.46119949
Root Mean Squared Error: 7306.325168044432
R_2 = 57.061721503384064 %
In [42]:
import pingouin as pg
pg.corr(x=df['year'], y=df['price'])
Out[42]:
n r CI95% r2 adj_r2 p-val BF10 power
pearson 2499 0.418 [0.39, 0.45] 0.175 0.174 0.000 3.908e+102 1.000

Build Linear Regression Diagnostics

In [49]:
from statsmodels.tsa.statespace import sarimax
model = sarimax.SARIMAX(df["price"], order=(1,1,1), seasonal_order=(1,1,0,12), enforce_invertibility=True)
results = model.fit()
results.summary()
Out[49]:
Statespace Model Results
Dep. Variable: price No. Observations: 2499
Model: SARIMAX(1, 1, 1)x(1, 1, 0, 12) Log Likelihood -27228.849
Date: Sun, 24 May 2020 AIC 54465.697
Time: 05:03:18 BIC 54488.971
Sample: 0 HQIC 54474.149
- 2499
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.0612 0.027 -2.228 0.026 -0.115 -0.007
ma.L1 -0.8880 0.013 -68.702 0.000 -0.913 -0.863
ar.S.L12 -0.5122 0.020 -25.783 0.000 -0.551 -0.473
sigma2 2.627e+08 1.7e-11 1.54e+19 0.000 2.63e+08 2.63e+08
Ljung-Box (Q): 423.59 Jarque-Bera (JB): 514.60
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.73 Skew: 0.46
Prob(H) (two-sided): 0.00 Kurtosis: 5.03


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 9.63e+33. Standard errors may be unstable.
In [50]:
predictions_int = results.get_forecast(steps=11)
predictions_int.predicted_mean
Out[50]:
2499    8846.634
2500    4541.621
2501    8021.696
2502   13331.860
2503    8188.082
2504    6644.171
2505    5780.873
2506    6583.216
2507    5780.873
2508    2246.882
2509    8188.085
dtype: float64
In [68]:
x = df["year"].values.reshape(-1,1)
y = df["price"].values.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
regressor = LinearRegression()  
regressor.fit(X_train, y_train) #training the algorithm
#To retrieve the intercept:
print(regressor.intercept_)
#For retrieving the slope:
print(regressor.coef_)
[-2989833.12707547]
[[1491.91031767]]
In [69]:
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df
Out[69]:
Actual Predicted
0 7760 19349.984
1 15900 22333.804
2 2900 11890.432
3 29000 16366.163
4 29300 22333.804
... ... ...
495 11500 22333.804
496 23300 20841.894
497 14500 14874.253
498 24900 22333.804
499 17500 17858.073

500 rows × 2 columns

In [72]:
df1 = df.head(19)
df1.plot(kind='bar',figsize=(14,6),color=("limegreen","gold"))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
In [73]:
plt.scatter(X_test, y_test,  color='gold')
plt.plot(X_test, y_pred, color='limegreen', linewidth=2)
sns.set(color_codes=True)
sns.jointplot( x = X_test, y = y_test, kind="reg");
plt.show()

Using k-Fold Cross validation for evaluate the accuracy of our model

In [97]:
from sklearn import svm
svr=svm.SVR(gamma='auto')
svr.fit(X_train, y_train.ravel())
Out[97]:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [98]:
y_predict=svr.predict(X_test)
In [99]:
y_predict.shape
Out[99]:
(249,)
In [100]:
r2_score(y_test,y_predict)
Out[100]:
-0.6175677390394947
In [103]:
from sklearn import tree

clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train,y_train)
r2_score(clf.predict(X_test),y_test)
Out[103]:
0.9999933017486675

One-way (one factor) ANOVA

In [161]:
import scipy.stats as stats
df1=data
# stats f_oneway functions takes the groups as input and returns F and P-value
fvalue, pvalue = stats.f_oneway(df.price,df.mileage)
print(fvalue, pvalue)
757.0150214182844 2.705711374302067e-155
In [172]:
# get ANOVA table as R like output
import statsmodels.api as sm
from statsmodels.formula.api import ols
# reshape the d dataframe suitable for statsmodels package 
d_melt = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['year', 'price', 'mileage'])
# replace column names
d_melt.columns = ['index', 'Years', 'value']
# Ordinary Least Squares (OLS) model
model = ols('value ~ C(Years)', data=d_melt).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
Out[172]:
sum_sq df F PR(>F)
C(Years) 3276355677789.422 2.000 1324.119 0.000
Residual 9271448032422.059 7494.000 nan nan
In [173]:
# load packages
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# perform multiple pairwise comparison (Tukey HSD)
m_comp = pairwise_tukeyhsd(endog=d_melt['value'], groups=d_melt['Years'], alpha=0.05)
print(m_comp)
      Multiple Comparison of Means - Tukey HSD, FWER=0.05      
===============================================================
 group1 group2   meandiff  p-adj    lower       upper    reject
---------------------------------------------------------------
mileage  price  -33531.014 0.001  -35863.608   -31198.42   True
mileage   year -50281.9712 0.001 -52614.5652 -47949.3772   True
  price   year -16750.9572 0.001 -19083.5512 -14418.3632   True
---------------------------------------------------------------
In [175]:
print(f"Overall model F({model.df_model: .0f},{model.df_resid: .0f}) = {model.fvalue: .3f}, p = {model.f_pvalue: .4f}")
Overall model F( 2, 7494) =  1324.119, p =  0.0000

Conclusion

  • During this project we're analysed US Cars Dataset which consist dataset includes 12 columns. Further time we will use total 5 columns as brand(the brand of car), year(the vehicle registration year), mileage(miles traveled by vehicle), price(the sale price of the vehicle in the ad), lot(a lot of number is an identification number assigned to a particular quantity or a lot of material from a single manufacturer. For cars, a lot of number is combined with a serial number to form the Vehicle Identification Number). This dataset included Information about 28 brands of clean and used vehicles for sale in US.